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Abstract: Graphical models are used to describe the conditional inde- 
pendence relations in multivariate data. They have been used for a variety 
of problems, including log-linear models (Liu and Massam, 2006), network 



analysis (Holland and Leinhardt, 1981; Strauss and Ikeda, 1990; Wasser 
man and Pattison, 1996; Pattison and Wasserman, 1999; Robins et al., 
1999), graphical Gaussian models (Roverato and Whittaker, 1998; Giudici 
and Green, 1999; Marrelec and Benali, 2006), and genetics (Dobra et al., 
\fj. • 2004). A distribution that satisfies the conditional independence structure 



of a graph is Markov. A graphical model is a family of distributions that 
is restricted to be Markov with respect to a certain graph. In a Bayesian 
problem, one may specify a prior over the graphical model. Such a prior is 
called a hyper Markov law if the random marginals also satisfy the inde- 
pendence constraints. Previous work in this area includes (Dempster, 1972; 
Dawid and Lauritzen, 1993; Giudici and Green, 1999; Letac and Massam, 
2007). We explore graphical models based on a non-parametric family of 

distributions, developed from Dirichlet processes. 
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1. Introduction 

Markov distributions are multivariate measures that satisfy a specified set of 
conditional independence relations. Often, an undirected graph is useful to rep- 
resent this struture. A measure is Markov with respect to a graph if whenever 
two variables have no edge between them, they are conditionally independent 
given the other variables in the graph. Dawid and Lauritzen (1993) extended 
this notion to the parameter space. In Bayesian statistics, the measure of the 
data is random, and therefore has its own distribution called the prior. A prior 
law over Markov measures is hyper Markov if it gives probability one to Markov 
measures and the random marginal measures have the specified conditional in- 
dependence structure. The hyper inverse Wishart distribution which serves as a 
prior for the multivariate Gaussian with known mean. The usual inverse Wishart 
is a specific case, which is hyper Markov for the saturated model. 

Like all parametric models, the hyper inverse Wishart prior makes strong 
assumptions about the shape of the distribution. In many applications, such 
assumptions are undesirable. In contrast, non-parametric models make weak 
assumptions. Typical assumptions include continuity and the existence of some 
number of derivatives. For example, one may specify that the distribution is 
smooth, having derivatives of all orders. The current paper aims to apply the 
non-parametric approach to graphical models. We achieve this by following the 
framework laid by Dawid and Lauritzen. We begin with the Dirichlet Process, 
a commonly used non-parametric prior law. We then describe how to build this 
family into a non-parametric hyper Markov prior. 

As in Dawid and Lauritzen (1993), we restrict our attention to decomposable 
graphs. The benefit of this is that a decomposable graph can be easily built 
up from smaller components called cliques which intersect to form the entire 
graph. Dawid and Lauritzen begin by considering a base distribution for each 
clique. The only requirement is that these distributions agree where the cliques 
intersect. They weave together base distributions by taking the base measure 
of one clique as its marginal, and conditioning the second clique on the inter- 
section. They repeat this process for every clique. The third clique is added by 
conditioning its base measure on the intersection with the previous two cliques. 
This process is repeated until all the cliques have been combined. The end result 
is a Markov distribution whose marginal over each clique is the clique's base. 
For a prior on Markov distributions, Dawid and Lauritzen construct a hyper 
Markov law in the same way 

As an example of the Dawid and Lauritzen (1993) construction, consider 
the problem of estimating the covariance matrix of a multivariate Gaussian. 
If we believe that the data exhibit some conditional independence structure, 
this implies certain constraints on the covariance matrix. (Speed and Kiiveri, 
1986) showed that the sufficient statistics are the component covariance matrices 
belonging to each clique. The inverse Wishart is the usual prior for the saturated 
model which has no constraints on the covariance matrix. In a non-saturated 
model, the sub-matrix of each clique is unconstrained, except that the sub- 
matrices must agree where their indices intersect. For this reason, the inverse 
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Wishart is the natural choice as the base measure for each clique. The sub-matrix 
for the first clique has an inverse Wishart prior. If the graph is connected and the 
cliques have a perfect ordering (see Section 2.1), then the first and second sub- 
matrices have some elements in common. Thus, the sub-matrix for the second 
clique is the inverse Wishart, conditional on knowing some of the elements. By 
repeating the conditioning for each clique, the hyper inverse Wishart is defined. 
In the current paper, we apply this framework to non-parametric priors. 
Instead of the inverse Wishart, the Dirichlet process prior is the base measure for 
each clique. Following the analogy, we build the marginals into a hyper Markov 
prior, which we refer to as the hyper Dirichlet process. The Dirichlet process 
is a special case of tail- free processes (Ferguson, 1973). Dirichlet processes have 
been used for non-parametric priors in many areas, including block models (Bush 
and MacEachcrn, 1996), survival analysis (Susarla and Ryzin, 1976; Ghosh and 
Ramamoorthi, 1995; Kim and Lee, 2001), and non-stationary point processes 
(Pievatolo and Rotondi, 2000). These are all areas that could potentially use a 
hyper Dirichlet process in multidimensional problems. In section 2, we explain 
notation and formalize some of the ideas presented so far. In section 3, we 
describe the Dirichlet and some previous results. In section 4 we weave Dirichlet 
processes on the cliques to build the hyper Dirichlet process, and show that it 
is a hyper Markov prior. Finally, we explore applications for this framework in 
section 5. 

2. Notation and Setting 

Throughout this paper we consider a graph, Q, with vertex set V and edge set 
E. By convention, we assume that (7, 7) G E for all 7. We call such edges loops. 
There is no practical difference if loops are excluded from E, though some minor 
changes are required for certain definitions. If A C V, then Qa is the subgraph 
of Q over A. The subgraph Q A has vertex set A, and edge set Ea = (A x A) flE. 
We say that A induces the subgraph C?a- If Ea = A x A, then Qa_ is complete. 
A clique is a set A such that Ga is complete and for any superset B D A, fe is 
not complete. For example, if Q itself is complete, then there is one clique, viz. 
V. 

A fc-path is a sequence (70, 71, ... , 7fc), such that (7,, 7i+i) G E, for < i < k. 
If A and B are subsets of V, then a path between them is a path between any 
a £ A and any b £ B. A graph is connected if there exists a path between every 
pair of subsets. A third subset C € Vis said to separate A and B if every path 
between them contains an element of C. A fc-cycle is a path such that k > 3, 
7o = 7fe an d the other elements are distinct. A graph is decomposable if there are 
no cycles longer than length 3. A decomposable graph admits a perfect ordering 
of its cliques. 

Definition 1 Perfect Ordering. Suppose a graph Q has n cliques. Let the 
cliques have an arbitrary ordering Ci, . . . , C„. Define H& = uf =1 Ci. For k > 2 
define Sk = C& l~l -fffc-i and Rfc = Ck \ Hk-i- The ordering of the cliques is a 
perfect ordering if for each 2 < k < n, there exists jk < k such that Sk C Cj k . 
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The sets H& are called the histories. The separators, Sfc, separate Ck from 
the previous history. The sets R* are called the residuals, which represent the 
new nodes being added to the history. In a perfect ordering, each new clique is 
separated from the current set of nodes by a single one of the earlier cliques. 

For every, 7 G V, X^ is a random variable taking values in the space 
(Xy,J-y). In this sense, we consider V an index set of components of some 
random variable X = (X 7 : 7 G V). We denote the range and a- field of X 
by (X, J-) = (Xy^vXy, x-ygv-^-y)- Furthermore, we extend these definitions to 
subsets, A C V. 

X A = (X A :jeA) 

Xj± = X-ygA^ 

Let a be a measure over some Xj±, then a = a/a{Xx)- In other words, a is 
the probability measure proportional to a. If B C A, then a-Q is the marginal 
of a over X-b- Thus, a&{U) — a(U x ^a\b) ; V U G J-&. For convenience, if 
7 G A, we may write 7 for #{ 7 }. If a and j3 are both measures on some space 
(X, J 7 ), then we define their sum, a + /3, by 

[a + 0](U) = a(U) + /3(U), Vf/ef. 
If x G X, then the delta measure S x is a point mass concentrated at x: 

2.1. Graph Selection 

For the remainder of the paper, we consider undirected graphs, which implies 
that (i,j) G E if and only if (j,i) G E. We also assume that the graph is 
connected and decomposable. An undirected graph depicts the conditional in- 
dependence structure for some variable X in the following sense: 

Xa -^Xb I Xc whenever C separates A and B. (1) 

Definition 2 Markov Probability Measure. If 9 is a probability measure 
on (X,J-), we say it is Markov on a decomposable graph, Q, if X ~ 8 satisfies 
the conditional independences in Q. 



Fig 1. A graph depicting conditional independence of I and K given J. 

Example 1 Let Q be the graph depicted in Figure 2.1. A measure 6 is Markov 
on G, if and only if Xj JLXk\Xj whenever X ~ 8. 
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Implicit in the definition is the fact that it is only sensible to refer to a 
measure as Markov in relation to a specific graph. For example, if the measure 
9 is not Markov on Q in Example 1, it is still Markov on the saturated graph 
with V = {/, J, K}. All measures over AV are trivially Markov on the saturated 
graph since there are no constraints on conditional independence. Furthermore, 
if /i is a measure such that each X y is independent, then it is Markov on any 
graph (with the appropriate vertex set.) We denote the set of all distributions 
that are Markov on Q by .JZ(Q). 

It will be useful to keep Figure 1 in mind throughout this paper. While the 
graph technically has only three variables, it is representative of any connected 
graph of two cliques. Instead of one variable, imagine /, J, and K to contain 
multiple variables, with J being the variables that belong to both cliques. / is 
the set of variables in one clique but not the other, and K vice versa. 

Let X ~ P G J£" be a random variable whose distribution is modeled by 
some family of probability distributions. In some applications, the focus is not 
on determining P, but on discovering the independence structure of X. A graph 
of this structure, Q, denotes the belief that P is Markov with respect to Q. Thus, 
it restricts the model to a sub- family, &g = & n Jt(C/). Graph selection is the 
problem of determining the smallest ^g that contains P. The most prevalent 
examples are graphical Gaussian models. Graph selection for Gaussian models 
is often called covariance selection. In this setting, the relevant family is the 
set of p-variate Gaussian distributions. Denote this family jV — {N p (n, £) : 
/i G W , E G Af+}, where M+ is the cone of real- valued, symmetric p x p 
matrices that are positive definite. Specifying a graph, Q, translates to putting 
constraints on S. For example, if (x\,X2,Xs) is such that x\JLxs\x2, then <7i3 
is no longer a free parameter, but a function of an, CT22, 033, CT12, and 023. In 
general, denote the sub- family of Gaussian distributions Markov on Q by JVg. Let 
Pg be the set of positive definite matrices such that K^ = for all (i, j) £ E; 
let Qg be the image of Pg under matrix inversion. Speed and Kiiveri (1986) 
showed that if N p {n, Eg) is Markov with respect to Q, then Sg G Qg. Thus 
jVg = {N p (/j,, E):/i6 1Z P , £ G Qg}. The goal of covariance selection is to find 
the smallest Qg containing E, the population covariance matrix. 

Much progress has been made with graph selection for parametric models. 
Dawid and Lauritzen (1993) proved many results for decomposable graphical 
models, including multinomial and multivariate Gaussian problems. For exam- 
ple, they present the distribution of the restricted maximum likelihood estimate 
of E for the jYg model with [i known. This distribution is called the hyper 
Wishart distribution since it reduces to the Wishart when Q is complete. Letac 
and Massam (2007) have extended the hyper Wishart to a richer family of dis- 
tributions on Qg and Pg. Giudici and Green (1999) implemented a reversible 
jump Markov chain Monte Carlo algorithm for determining Q. 

The family of hyper inverse Wishart distributions is the subset of Markov 
distributions such that each clique marginal is inverse Wishart. Carvalho et al. 
(2007) provide an algorithm for generating random variables from this family. 
For decomposable models, the presence of a perfectordering simplifies the pro- 
cess. For two cliques, the algorithm begins by generating an inverse Wishart 
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variable on one clique. If the cliques overlap, this determines some of the pa- 
rameters for the other clique. Therefore, one needs to generate a conditional 
Wishart variable given those entries. For multiple cliques, one simply repeats 
this process. With a perfect ordering, the process is simplified because each 
new clique is conditioned on only one previous clique. Conditioning on multi- 
ple cliques can lead to moderate complications in the conditional distribution. 
Hence, decomposable models are computationally convenient. 

3. The Dirichlet Process 

The Dirichlet process (Ferguson, 1973) is a special case of tail- free distributions. 
It is a prior, meaning that it provides a distribution over the space of proba- 
bility distributions on (X,J-). In this paper, we use the term law to refer to 
distributions over probability measures. However, this terminology is merely a 
convenience; the words "law" and "distribution" are typically interchangeable. 
The Dirichlet process is an example of a non-parametric law, which means that 
it cannot be specified by a finite-dimensional parameter. In this section, the 
Dirichlet process is introduced and some of its useful properties are given. This 
leads into the next section, in which we show how to compose a hyper Dirichlet 
processes from multiple Dirichlet processes. 

Definition 3 Dirichlet Process. Let A be any subset of V. Let a be a 

measure over (X^,J~a), and let 9 be a random probability measure over the 
same space. We say that the distribution of 9 is a Dirichlet process with base 
measure a, and write 9 ~ DP a , if 

(P(Ax), ¥(A 2 ), . . . , P(A k )) ~ Dir(a(i4i), a(A 2 ), ..., a(A k )), (2) 

whenever (Ai)f =1 is a partition of A. 

This definition leads to some useful properties, described in the following theo- 
rem. 

Theorem 4 POSTERIOR Dirichlet Process. Let 9 ~ DP a and, given 9, let 
Xi, . . . , X n be an iid sample from 9. 

(i) X i ~ a, V i . 
(ii) 6\ (Xx , . . . , X n ) ~ DP al , where a' = a + £)" =1 6 Xi ■ 

See Theorem 1.9.4 of Schervish (1995), p. 54. D 

The first property states that if the random measure is integrated out, the 
marginal distribution of the data is a. This property ensures that a Markov 
base measure implies that the Dirichlet process, integrated over all possible 9, 
is a Markov distribution. This does not guarantee that the process is a hyper 
Markov law. That requires the stronger condition that 9 ~ DP a is a Markov 
distribution with probability one. The second property states that if a Dirichlet 
process is used as a prior measure, then the posterior measure is also a Dirichlet 
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process, with an easily updated base measure. This fact helps determine which 
properties of a prior will persist in the posterior. 

If the prior law of 9 is a Dirichlet process, then the various marginal dis- 
tributions of 9 will also have a Dirichlet process law. This is expressed in the 
following theorem. 

Theorem 5 Marginal of a Dirichlet Process. Let 9 ~ DP a be a random 
probability measure on (XxiTk)- For B C A 7 the marginal of 9 over B is 
9n~DP as . 

Proof: Define A' = A \ B. Let B\, B2, ■ ■ . , B k be a measurable partition of 
B. 



(Pfe(Bi),...,P* B (B fc )) = {We{B 1 xX A ,),...Je{B k xX AI )) 

~ Dir(a(Bi x X A ,),...,a(B k x X A ,)) 
= Dir(a B (Bi),. .. ,a B (B fc )). n 



We proceed by showing how the Dirichlet process can be used as a non- 
parametric prior (see Ferguson (1973) for details.) Let F be an unknown cu- 
mulative probability distribution that we wish to estimate. For simplicity, we 
consider a one-dimensional random variable. Let ir — DP a be the prior law. Let 
the loss function be a squared error loss. Then the Bayes' risk is 

K(F,F) = J E(F(t) - F(t)) 2 dt. (3) 

The risk is minimized by setting F(t) to EF(t), where the expectation is 
relative to the posterior distribution. If we observe data X\,X2 1 ■ ■ -X n , then 
the posterior is DP a i, where a' — a + J^ Sxi (see Theorem 7.) The posterior 
distribution of P((— oo,t]) is Beta(a'((— oo, £]), a'((t, oo)). Therefore, 



FF(t\ = a'{{-ooA) U) 

U a'((-oo,t]) + a'{{t,oo)) { ' 

a((-oo,t}) + J2 , i=i 1 (x l <t) . . 

a{X) + n ' [ ' 

The Bayes estimate can be written as a weighted sum of two estimates 

F(t) - EF(t) (6) 

= (l-w)a((-oo,t}) + wF(t), (7) 

where a((—oo,i\) is the prior estimate, F(t) is the empirical cdf, and w = 
n(a(X) + n)^ 1 is the weight of the data. This convex combination of a prior 
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estimate and frequcntist estimate is common in Bayesian analysis. This shows 
the role of the base measure in the Dirichlet process, a is the prior guess about 
the shape of the unknown distribution. a(X) is mathematically equivalent to 
the prior sample size. 

3.1. Dirichlet Process as a Stick- Breaking Prior 

A stick-breaking process is an almost surely discrete random probability mea- 
sure, 9, that can be expressed as 

N 

0(-) = 5> fc 5z>(-), (8) 

fc=i 

where the Zk are independently distributed atoms from some distribution G, 
and Y] k —i tUfc = 1 almost surely. The number of atoms, TV, may be finite or 
infinite. The weights are determined by successively breaking random pieces of 
a unit-length stick. Thus, w\ = pi, w 2 = (1 - Pi)p2, and Wk = Pk IliJi (1 _ Pi)- 
For finite N, wn is defined by 1 — $Z i=1 Wi, or equivalcntly by H i=1 (l — Pi)- 
Traditionally, stick-breaking measures are defined such that pk is defined as a 
Beta(afc,6fc) random variable for 1 < k < N. Thus, a stick-breaking measure 
is specified by a probability distribution P, and a countable sequence of Beta 
parameters (ak,bk) k =i- Sethuraman (1994) showed that a Dirichlet Process is 
a stick-breaking measure with Zk ~ a, and (ak,bk) — (0, <*(<¥)) for all k G N. 
This relationship leads to an alternative definition of the Dirichlet process. 

Definition 6 Dirichlet Process (alternate definition). Let A be any subset 
o/V. Let G be a probability measure on {X\, J-'a) , and let 9 be a random prob- 
ability measure over the same space. For v > 0, we say that the distribution of 
9 is a Dirichlet process with base distribution (or measure) G and precision v, 
and write 9 ~ DP{vG), if 

(P(i4i), P(A 2 ), ..., P(A k )) ~ Dir(i/G(i4i), vG{A 2 ), ..., vG{A k )), (9) 
whenever (Ai)^ =1 is a partition of A. 

Note that this distribution is equivalent to Definition 3 by letting a = vG. 
For example, v is equivalent to the prior sample size, and G is equivalent to the 
prior mean. In this definition, v and G are easily translated as the parameters 
of a stick-breaking measure. That is, the random atoms are iid G, and pk ~ 
Beta(0, v) for all k G N. Because the stick-breaking representation is useful for 
many of the theorems we prove, Definition 6 will be the definition of choice for 
much of the current paper. 

The previous theorems regarding Dirichlet processes can be expressed us- 
ing vG notation. For example, we rewrite Theorem 4 regarding the posterior 
Dirichlet process. 

Theorem 7 Posterior Dirichlet Process (alternate). Let 9 ~ DP(vG) 
and, given 9, let X\, . . . , X n be an iid sample from 9. 
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(i) X t ~G V i. 
(ii) 6\(Xi, . . .,X n ) ~ DP V , GI , where v' = v + n and G = (v + n)- l (vP + 

In the following section we introduce the hyper Dirichlet process and show 
that it is an example of a stick-breaking measure. We then use Equation 8 to 
prove some of its properties. While we focus on the hyper Dirichlet Process for 
simplicity and concreteness, many of the results apply to other stick-breaking 
processes as well. 

4. The Hyper Dirichlet Process 

Consider a multivariate variable X with distribution 9. Suppose that we know 
little about 9, other than it is Markov on some decomposable graph, Q. In this 
case we may wish to specify a non-parametric prior for 9. For example, we focus 
on the Dirichlet process. There are two main difficulties with this approach. The 
first is the elicitation of a proper base measure. The second is ensuring that the 
Dirichlet process gives probability one to ^C(Q). Both concerns are addressed 
by using a framework that we dub the hyper Dirichlet process. 

To define a hyper Dirichlet process, we begin by eliciting a base measure for 
each clique in Q. Hopefully, this is simpler than eliciting a base measure for the 
entire graph at once. These base measures are combined to form a base measure 
over the entire graph. We define these combinations in a way which will ensure 
that the support of the process lies within the set of Markov distributions on 
Q. In the remainder of this section, we provide details to this method and show 
that it satisfies the Markov property. 

4- 1. Markov Combinations of Probability Measures 

Dawid and Lauritzcn (1993) show that if two subsets of V are each endowed 
with a marginal probability measure, then there is a logical choice for their joint 
distribution, provided the marginals satisfy a consistency condition. 

Definition 8 Consistency (of probability measures) . Suppose A, B C 
V. Let a and X be probability measures on (A?a,.7"a) ond (^Bj-^b), respectively. 
We say that a and X are consistent if they induce the same marginal over Aahb ■ 

Note that a and A are consistent only if 

m(*a\b x U) = A(* b \a x U) V U e f AnB . 

Theorem 9 Suppose fi on (Aa,Ja) and A on (Ab,.Fb) are consistent prob- 
ability measures, with A, B C V. There exists an almost- everywhere unique 
distribution, a, such that: 

(i) a A = u, 
(ii) qb = A. 
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(Hi) a e -^(Gaub), 

Proof: Construct a such that its marginal over Xa is n, so that condition 
(i) is satisfied. Specify its conditional distributions over X-q given Xa to be the 
same as the conditional distributions of A given Xahb- This ensures that (iii) 
holds as well. Let C = A n B and B' = B \ A. Then for any U £ B' and V 6 C, 



,(UxV) 



\ B , lc (U\V)P at 


AV) 


'x B , lc (U\VWu c (V) 


\ B/lc (U\V)F Xo (V) 


\{U x V). 





The second equation follows from the construction of a. The third equation 
is ensured since /x and A are consistent. Hence, condition (ii) is also satisfied. 
Furthermore, the conditional distributions are unique, except over some subset 
of Xc with zero measure under A, and hence also under /i by consistency. There- 
fore, this construction gives (a version of) the unique distribution satisfying the 
conditions. 

Definition 10 Markov Combination (of probability measures). Let n 
and A be as in Theorem 9. We call the unique distribution satisfying (i)-(iii) the 
Markov Combination of /i and A, and denote it by fi-k A. 

Now suppose Q has a perfect ordering of cliques (Ci, C2, ■ ■ • , Cfc), and that 
each clique Cj is imbued with a marginal probability distribution P,. Further 
suppose that P, and Pj are consistent for all i,j. Each clique is consistent with 
the previous history regarding the separator, since the separator is contained by 
a single previous clique. Using the idea of a Markov combination iteratively, we 
stitch together a distribution that is Markov on Q and has the given marginals. 
Define G\ = P±, and Gi = Gi-\ * Pi for i > 2. Dawid and Lauritzen show that 
G = Gk is the unique Markov distribution satisfying Gc k = Pk- We call G 
the Markov combination of P\, . . . , P)~. In general, we may write *(Qi, . . . , Qk) 
to indicate a Markov combination with the understanding that the cliques are 
perfectly ordered and Qi, ■ ■ ■ ,Qk are pairwise consistent. 

4-2. Markov Combinations of Finite Measures 

Using Markov combinations, we are able to take probability distributions and 
build a distribution over the entire graph. The base measure of a Dirichlet 
process, however, is not necessarily a probability distribution. Therefore, we 
proceed by extending Markov combinations to finite measures in general. For 
probability measures, we required the conditionals over Aahb to be the same. 
We simply extend this definition to any finite measure. 

Definition 11 Consistency of Finite Measures. Let a be a finite measure 
over (Xa,^Fa) and X be a finite measure over (X&^Tk)- We say that \i and A 
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are consistent if they induce the same marginal measure over A (~1 B. That is, 
/j, and A are consistent if 

M*a\b x U) = A(A- MB xU) VUe f AnB . (10) 

Recall that ~p is the probability measure proportional to ji. Equation 10 holds 
if the following two conditions are satisfied: 

1. Ji and A are consistent. 

2. h(Xa) = A(X B ). 

Consider these two conditions in the context of base measures for Dirichlet 
processes. JJ is the prior guess about the probability distribution of Xj±, and A is 
the prior guess for X&. The first condition therefore states that the priors must 
agree about the distribution of Aarb ■ It is reasonable to require that our prior 
is coherent in this way. The second condition states that the prior sample sizes 
for both sets of variables must be equal. This restraint is perhaps less desirable. 
It would be perfectly logical to be more certain about certain dimensions than 
others. Unfortunately, any measure on A^aub must satisfy 

cva(^a) = / da A = / da = o:(Aaub) = / / da — da B = Ob(^b). 

(11) 

If /i(Aa) i=- A(<Yb) there is no measure a on Aaub satisfying ax = t 1 an d 
a-Q = A. In some situations, this problem is not too severe. Using the alternative 
definition, we express /i = v\G\ and A = viG-2- The consistency conditions 
translate to G\ — G2 and v\ = v^. If only the second condition fails, then it 
is still possible to find G = G^-kG?,. Employing the stick-breaking condition, 
we can generate random atoms from G. The problem lies in assigning weights 
to each atom. However, we can take solace in a mitigating factor. For density 
estimation, the value of the prior precision (y) is typically small compared to the 
sample size. Hence, we may be reasonable to simply scale the precisions so that 
v\ and vi are equal. For these applications, it is only important that G\ and G2 
are consistent. If so, the base measures /i and A only need to be proportional to 
each other over A n B. 

There may be other situations in which scale is important. Unfortunately, 
as Equation 11 shows, we cannot find a suitable base measure for the prior 
that satisfies both n and A. Without a suitable prior, there can be no suitable 
posterior. If the goal is to estimate a distribution and there is genuine concern 
about the precision of the prior estimate, then both conditions must be satisfied. 
That is, /j, and A must be consistent. Equivalently, G\ and Gi must be consistent 
and v\ must be equal to v\. Cases in which one or both conditions fail are 
explored more fully in Appendix A. 

Subsequently, we assume that both consistency conditions are satisfied. This 
leads to a natural extension of the previous work. We have equated consistency 
of base measures with consistency of probability measures. Thus, we generalize 
Markov combinations to include consistent finite mcasurcszed by scaling them 
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to probability measures, finding the Markov combination, and rescaling the 
measures. 

Definition 12 Markov Combination of Finite Measures. Let a be a fi- 
nite measure on (Xa,Fa)- Let X be a finite measure on (X-b,Pb) that is con- 
sistent with jjl. The Markov combination of fi and X is denoted (J,*X, where 

u* A = u{X A ) ■ [JJ* A] = A(* B ) • [m* A], (12) 

where [p-kX] is the almost- everywhere unique probability distribution satisfying 
Theorem 9. 

This definition is a generalization of Definition 10 for probability measures. Note 
that the Markov combination defined in this way is unique almost everywhere, 
since [p* A] is unique almost everywhere. 

It is easy to show that the T and * operations commute (with respect to 
composition). 



Theorem 13 // /i and X are consistent measures, then u -k A = n * A. 
Proof: 

^ = [m*A](*aub) _ (13) 

M<*a) Jm*a] / 14 s 

m(^a) • [7**A](Aaub) 
= p*X. n (15) 

Writing the base measures in the alternative notation, set /i = vG\ and 
A = vGi- Theorem 13 states that u*X = G\ * Gi- Therefore, the Markov 
combination of vG\ and vGi can be written v(G\ -kG-i). 

4-3. Constructing the Hyper Dirichlet Process 

We now apply the idea of Markov combinations to component Dirichlet pro- 
cesses. To do so, we simply form the Markov combination of the base measures. 

Theorem 14 Let G\ be a distribution on (X A , Pa)- Let Hi be a distribution 
on (X a ,Pa) that is consistent with H\. Set H = Hi * Hi. Let Q ~ DP{vH\), 
R ~ DPiyH-i), and 6 ~ DP(vH) be random probability measures. The following 
are true: 

(i) 6 A = Q 
(ii) 9 B =R 

Proof: The proposition follows from Theorem 5. □ 



Note that the non-parametric approach is actually simpler than the paramet- 
ric approach in one sense. The hyper inverse Wishart is a generalization of the 
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inverse Wishart to incomplete graphs. However, the distribution of in Theo- 
rem 14 actually is a Dirichlet process with a Markov base measure. Therefore, 
previous results regarding the Dirichlet process also apply to the hyper Dirichlet 
process. Most importantly, we know that the prior law is a stick-breaking prior. 
The hyper inverse Wishart is so-called because it is an example of a hyper 
Markov prior (Dawid and Lauritzen, 1993). 

Definition 15 Hyper Markov. Consider an undirected graph, Q. Let 8 ~ C 
be a random probability measure over X . We say that C is (weak) hyper Markov 
on Q if £ is concentrated on ^{Q), and #a-LL$b|#c whenever C separates A 
and B. 

It is tempting to refer to the distribution of 9 in Theorem 14 as a hyper 
Dirichlet process; however, the given conditions are not sufficient to ensure that 
the process is hyper Markov. The next task is to discover the appropriate con- 
ditions under which the hyper Markov property holds. Let C = DP(vH) be a 
Dirichlet process law. Let Q be any graph consisting of two cliques, A and B, 
with separator C. Using the stick-breaking construction, let w = (wi : i € N) 
be the random weights and Z = (Zi : i G N) be the atoms, which are iid obser- 
vations from H. We use Zn> to denote the components of Zi belonging to a set 
r. For example, the marginal of over A is 9\ = Sieiv w i^z iA - 

Obviously, one condition for hyper Markovity is that H is a Markov measure. 
If H is not Markov, then for Zi <~ H, it is not true that Z^a -U- ,Z»b | ■Zjc ■ As a 
result, 9 A 4L9 B \0 C . 

H G ^((Q) is a necessary condition, but it is not sufficient. This is because 
knowledge of 9b contains information about the distribution of weights at each 
atom. We must ensure that 6c contains the information as well. To see this, 
consider an example for which He is a point mass. For H G ^(Q), this im- 
plies that HaALHb- Further suppose that Ha and H-b are not point masses. 
In this case, Q& implies certain constraints on w. For example, if each Z^b is 
distinct, then the mass at each atom determines the random weights modulo 
permutation. Therefore, the second condition for C to be hyper Markov is that 
#b contains no information about w that is not contained by #c- To begin, we 
use the condition expressed in the next theorem. The condition is sufficient, but 
more restrictive than necessary. 

Theorem 16 Let H be a base measure on Aaub- Let C = A (~l B. Set C = 
DP(vH) for some v > 0. Then C is hyper Markov on <%aub if the following 
conditions hold: 

1. H is a Markov measure 

2. Refinement Condition: 

Zic = Zjc => ZiB = Zjb a,.s.[H]. 

Proof: Define A' = A \ C and B' = B \ C. Note that B = C U B', so 
that ZiB = Zj-b => {Zic, ZiB') = (Zjc, ZiB')- In other words, the refinement 
condition can be expressed equivalcntly as an "if and only if" statement: 
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ZiC — Zjc ^=^ Z, l b — Zj-b a.s.[-ffj. 

Consider 9 ~ C. The hyper Markov property has two conditions: 

1. P(0 G .#(£)) = 1, and 

2. A X0 B |0c 

The first condition follows from the refinement condition. Let x = (xa' , ic, #b') 
be any point in X such that #b(^) > 0. That is, there exists some i such 
that Zi& = ib- By the refinement condition, Zjc = Zic = xc if and only 
if Zjb = Z iC = Xb- Hence, {j : Z jC — xc} = {j ■ Z jB = x B }. Using the 
stick-breaking representation, we write the distribution of Xj±\Xb- 



^aib(^a|^b = x B ) = XB X -^ ( 16 ) 



E 


Wil {ZiA = 

IB 


-xa} 


i: 

E 

i:Zic=i 


E w « 

Z z u—xu 

Wil{Z iA -- 


=x A } 


E w * 

i:2ic=a;c 

#a|c(^a|^c = X 


c). 



(17) 

(18) 

Therefore, 9 E Jt(Q). 

It remains to show that #a-U-#b|#c- We begin by writing the marginals of 9 
using the stick-breaking representation. Let T be any subset of V. 

Let Zp = {Zir} be the set of unique occurrences among the random atoms. We 
refer to an element of this set using an arbitrary index, Z* T . Let m* r be the 
total mass at that atom; 

<r= E ^=0t{Z* t ). (20) 

j:Zjr=Z* T 

Note that Z r is the support of the 9r, and m r is the mass at each point 
in the support. Thus, there is a bijection between the measure 9t and the 
set {Zf,m r }. That is to say, both are completely identified if at least one is 
known. The immediate result is that conditioning on one (or both) is equivalent 
to conditioning on the other (or one of them) . 

Continue by partitioning the support into two sets. Define Z£ = {Z* T : 
H T {Z* T ) > 0} and Z r = {Z* T : H T {Z* T ) = 0} = Z r \ Z+ . In other words Z+ is 
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the set of support points with strictly positive mass and Zp 1 is the set of points 
that are in the support but have probability zero. Again, we specify a particular 
element in either set with an arbitrary index, e.g. Z^p. Partition rrip in the same 
way. This yields, 

m+ = {m* r : H T (Z* r ) > 0} - {m* r : Z* T 6 Z+} = {9 T {Z+ T )}. (21) 

We stipulate that the index is consistent with Zp~ so that m^p = 9r(Z^ r ). 
Denote the other set in this partition by rrip = rrip \mf, where m® T = 9t{Z® t )- 
Separate the sum in Equation 19 using this partition. 

Or = J2 <r<V r + E m ^ r . ( 22 ) 

i=\ * i=\ 

where A r = |Z r |. Note that Zjt , has a degenerate distribution. If Hr(j) > 0, 
then with probability 1, 7 will occur infinitely often in Zp. Therefore, Zjt = 
{zp : Ht(zt) > 0} almost surely. Since H is known, the sets of summation in 
Equation 22 are fully identified by {Zp, rrip}. It follows that conditioning on 6>p is 
equivalent to conditioning on the quartet {Zp , rrip, Zp, m r }. We will now show 
that under the refinement condition, Z^,rh^, and to^ are fully identified from 

9c- With that fact, showing #a^#b|#c is equivalent to showing #a-U-Z b |#c- 
By the refinement condition, 

m ic = zZ w i = zZ W J = m ^c- ( 23 ) 

j:Z jC =Z+, j:Z jB =Z+ B 

A similar equation shows m- c = rrr° B . Therefore, (rn c ,rh c ) = {rn^,rh%). 

We now show that Z B is a function of Zjt . This fact ensures that Z B is fully 
identified by Z,t and therefore is conditionally independent of anything given 
Zi. One consequence of the refinement condition is that if Hq{c) > 0, then 
there exists B(c) such that H- B \ c (B(c)\c) = 1. This follows from a simple proof 
by contradiction. If i?B|c('l c ) is n °t a point distribution, than either every point 
has probability (as in a continuous distribution), or there is some point with 
positive probability strictly less than 1. We will see that neither of these can be 
true and conclude the conditional is indeed a point distribution. 

Suppose i?Bic('l c ) nas measure zero everywhere. With probability Hc{c) 2 > 
0, the event Z\c = Z^c will occur. However, Zib 7^ Z2B almost surely. There- 
fore, the refinement condition fails with probability at least Hc(c) 2 > 0. Now 
suppose there exists b such that < -ffB|c(H c ) < 1- Then with probabil- 
ity H c (c) 2 H B \ c {b\c)(l - H B \ c {b\c)) > 0, the events Z 1C = c = Z 2C and 
Zib = b =/= Z2B will occur. Thus, the refinement condition fails with posi- 
tive probability. By these two contradictions, we see that -ffB|c('l c ) must be 
a point distribution if Hc(c) > 0. We denote the point of concentration by 
B(c). Clearly, c e Z,t implies that B(c) 6 Z B . Furthermore, every element 
of Z B = B(c) for some c e Zi. This follows from the fact that C C B, so 
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Hc(Z iC ) = => H B (Z lC ,Z m ,) = 0. Therefore, Z+ = g(Z+) = {B{c) : c G Z+} 
almost surely. 

We have shown that conditioning on 9. is equivalent to conditioning on 
{Z + , m + ,Z°, rh }. Furthermore, we have that (Zg , mj, m^) = (g(Z^), rn c , rh c ) 
This provides an equivalent condition for the independence property that we 
want to show. That is, A -U-0 B |0c if and only if 6a -U- Z B |{Z^, m£, Z£,m^}. 
The remainder of this proof will show that the second property holds under the 
conditions of the theorem. 

Begin by partitioning the atoms and weights as follows. Let Z = {Zi : Z^c G 
Zq}, and Z = {Zi : Zic G Z c }. Let w = {wi : Zic G Z,t}, and w = {wi : 
Zic G Z c }. As usua l, f° r rcV, let Z-p and Zr denote that the elements are the 
components in T. This partition is similar to, but different than, the partition 
defined earlier. (Zr, Zr) depends on He, whereas (Zjt , Zp) depends on Hr- The 
goal, as above is to rewrite Za by partitioning it in a way that preserves the 
conditional independence structure. This structure is preserved if the partioning 
function is non-random. In other words, the atoms must be partitioned based 
on on a known event. When conditioning on 0b, #b and 0c are known, but 9a 
is unknown. Therefore, (Za,Za) provides an observable partition of Za. 



^wiZiA +y t 



^^ WiZ iA + V"wjZ iA - (24) 



Note that Zc is equivalent to Z c by definition, and Z-q = Zg by the re- 
finement condition. We proceed by showing that w,Za,w, and Za are jointly 
independent of Zg given {Z,t, m£, Z c , m c }. We can express to£ as a function 



of w, Z, and Zi, where 



_i_ a.s. 



Furthermore, we have noted that m c — w. By the stick-breaking construc- 
tion, ZJL(w, w, Z). Since Z^ is known almost surely, it can be included in the 
independence property. 

Z_LL(Z+,Z,w». (26) 

Z is also independent of any function of the RHS of Equation 26. In particular, 

Z_LL (rn c , m c , Zjt, Za, w, w). (27) 

Repeating this argument on the LHS of Equation 26, we conclude 

(Z£,ZA,Z£)X(m+,m^,Z+,Z A >>). (28) 

Since Zj ~ H G ./#(<?), we can write Zg J1Za|Z c . Since all three of these 
are jointly independent of (Za, w) and (m^, m c , Zq)), it follows that 
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Z^JL(Z A ,w,Z A ,w)\(m+,m c J+,Z° ; ). (29) 

Recall from Equation 24 that 9 A is a function of (Z A ,w, Z A ,w). A discrete 
distribution can be denned by a set of points and the probability mass at each 
point. Therefore, 9 a is a function of (Z A ,w, Z A ,w). It follows that, 

4ie A |(m+,m° c ,Z+,4). (30) 

Hence, by the above argument, it follows that #a-LL0b|#c- We conclude that C 
is hyper Markov. □ 



Theorem 16 provides sufficient conditions for a Dirichlet process to be hyper 
Markov. Thus, under those conditions, we may safely call the Dirichlet process 
a hyper Dirichlet process. When H satisfies the refinement condition, we will 
say that B is a refinement of C under sampling almost surely under measure 
H. It is a refinement in the following sense. Let Ai, A2, ... be an infinite iid 
sample from H. Form a partition of the natural numbers such that i and j arc 
elements of the same set if and only if A^c = Xjc- Call this partition A"(C). 
Define A(B) by analogy. Under the refinement condition, A(B) is almost surely 
a refinement of X(C). We denote this relationship by A(B) < A(C) a.s.[H], 
omitting H if the measure is contcxtually evident. 

The refinement condition, as stated in Theorem 16 is sufficient, but it is 
stronger than necessary. By symmetry of conditional independence, 6*b-LL0a|$c, 
even though no refinement condition is needed between C and A. It may be 
necessary that at least one of the two refinements is present, but this has not 
been explored. 

The hyper Dirichlet process defined on two cliques is an example of a hyper 
Markov combination, which is the analog of Markov combinations for prior 
laws. Consider two laws: Q for 9 A and 1Z for 9s- We say that Q and TZ are 
hyperconsistent if the marginal laws for #aub are equal. Under this condition, 
Dawid and Lauritzen (1993) show that there is a unique hyper Markov law C 
such that £a = Q, £b = TZ. This is called the hyper Markov combination and 
is denoted C = Q TZ. 

As with Markov combinations, hyper Markov combinations are easily gen- 
eralized to multiple cliques. Let Q be a graph with perfectly ordered cliques 
(Ci, . . . , Cfc). Suppose Ci is imbued with a prior law Qi and that the priors are 
all pairwise hyperconsistent. Let L\ — Q\ and d — Ci-i Q Gi for i > 2. Then 
C = Ck is the unique hyper Markov prior satisfying Cd = Qi- We call C the 
hyper Markov combination of Q\, . . . , Qk- In general we may write &(Qi, ■ ■ ■ ,Gk) 
with the understanding that the cliques are perfectly ordered and Gi, ■ ■ ■ ,Gk arc 
pairwise consistent. 

The next definition generalizes the hyper Dirichlet Process to three or more 
cliques. 
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Definition 17 Hyper Dirichlet Process. Let Q be a graph with a perfect 
ordering of cliques Ci, . . . , Cfe. Suppose that the i th clique has marginal distribu- 
tion Gi and that the marginals are pairwise consistent. Let G = *(Gi, . . . , Gfc). 
Further suppose that Cj or Hj is a refinement of Sj under sampling almost 
surely under H , where H^ is the i th history and Si is the i th separator. Then 

HDP(i/,Gi,...,G fc ) = DP(i/G) (31) 

is a hyper Dirichlet process prior. 

This hyper Dirichlet process defined in this way is guaranteed to be hyper 
Markov. Suppose C = HDP(z/, Gi, . . . , G k ). By Theorem 7, C Cz = DP (vGi) for 
i > 2. Furthermore, it follows from the refinement conditions and Theorem 16 
that £h;_iuC; = DP(^GHi_i) © DP (vGi). Hence, Theorem 3.9 in Dawid and 
Lauritzen (1993) states that C is the almost-everywhere unique hyper Markov 
law such that Lc i = DP {yGi). 

The next theorem states that if the prior distribution of 9 is a hyper Dirichlet 
process, then so is the posterior. 

Theorem 18 Posterior Hyper Dirichlet Process. Suppose Q is a graph 
with a perfect ordering of cliques Ci, . . . , Cfe. Let C = HDP(z/, H\, . . . , H k ) be a 
hyper Dirichlet process. Given 9 ~ C, let X\, . . . , X n be an iid sample from 6. 
Denote the marginal value of Xj over the i th clique by Xji . Then, 

0\X u ...,X n ~ HDP(^, !![,..., H' k ), (32) 

where H[ = {v + n)- x {vH + £\ $x 3Z ) 

Proof: By definition 9 ~ DP(vH), where H is the Markov combination of 
Hi,..., H k . Therefore, Theorem 7 states that 9\X X , . . . ,X n ~ DP{v'H'), where 
v' = v + n and H' = (y + n)~ l (vH + E,- <^x,)- Furthermore, the marginal 
distribution of H' over the i th clique is H[. It follows that H' is the unique 
Markov Combination of H[, . . . ,H' k . It remains to show that DP(y'H') is a 
hyper Markov law via the refinement condition. 

For 1 < i < k, let Si be the i th separator and let G^ be the i th history. By 
Definition 17, X(Aj) < X(Sj) a.s.[if], where Aj is either the i th clique or the i th 
history. Let Z be such that H' S (Z) > 0. We will show that there exists some a, 
such that Hx\s{-\Z) = £,(•). From this it follows that X(Aj) ^ X(Sj) o.a.[F']. 
We consider the cases Hg(Z) > and H$(Z) = separately. 

Suppose Hg(Z) > 0. By the refinement condition, there exists a such that 
■^A|s( - I^) = ^a(')- Therefore, if Xjs = -Z, then X,a = a with probability one. 
Hence, 



„, (l7) ^s(^)gA|s(-|^)+E^X jA (^-) .... 

" A|s( ' |Z) ^FE^) (33) 

a _ s . ^ S (Z)5 (-) + E^a(-) 



i/# s (Z) + £,<*„(•) 



(34) 
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= *„(■)■ (35) 

On the other hand, suppose Hg(Z) = 0. Since H' S (Z) > 0, there is some i 

such that Xis = Z. Furthermore, with probability one it holds that Xjs =/= Z 
for all j 7^ i. Therefore, 

H , (]7] vH s {Z)H MS {-\Z) + ^5x 3 AZr) 

Ha ^ IZ) = ,h s (z ) + e 3 s XiS (z) (36) 

a.s. °X jA V) /„„x 



8xM z ) 
= Sx iA (-)- (38) 

The last equation holds because A C S, so 8x jA {') — 1 is a stronger condition 
than Sxjs(Z) = 1. 

From these two cases, we see that if H' S (Z) > then H' A , S (-\Z) = S a (-) for 
some a which depends on Z. Ergo, for X, Y ~ H', Xs = is implies Za = Ya 
almost surely and DP(vH) is a hyper Markov measure. □ 

5. Applications 

At first glance, the refinement condition seems unduly restrictive. However, it 
allows one to use hyper Dirichlet processes in most areas that have benefited 
from Dirichlet processes. For the many applications that use a continuous base 
measure, the random atoms are distinct with probability one, so the refinement 
condition is trivial. Furthermore, Theorem 18 states that the posterior will also 
be hyper Markov. Thus, the hyper Dirichlet process can replace the Dirichlet 
process in applications requiring a posterior estimate, as in MCMC. 

The hyper Dirichlet process provides a non-parametric alternative to the hy- 
per Markov laws currently used for problems such as covariance selection for 
fitting graphical Gaussian models. Most previous work has focused on hyper 
inverse Wishart priors over Qg (Roverato and Whittaker, 1998; Giudici and 
Green, 1999; Letac and Massam, 2007). The hyper inverse Wishart is conjugate 
to the hyper Wishart, which is the distribution of the restricted maximum likeli- 
hood estimate for the Gaussian problem with known mean. However, covariance 
selection could also be achieved using hyper Dirichlet processes. This provides 
a non-parametric prior for the covariance matrix. This could be advantageous if 
we wanted to remove the Gaussian assumption. A logical choice is to specify a 
hyper inverse Wishart for the base measure, setting the parameters by empirical 
Bayes. The precision parameter controls the concentration of the prior around 
this measure. Thus, it can loosely be considered a measure of confidence in the 
Gaussian assumption. 

The real power of Dirichlet processes is in modeling mixture distributions. 
Suppose X\ 1 . . . , X n are observations from some family parameterized by w. If 
we allow 7Tj to be different for each observation, then the result is a mixture of 
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distributions. The number of parameters increases with n, which necessitates 
placing some prior G on the distribution of 7r in order to fit the mixture. If 
the prior is unknown, it can be modeled with a Dirichlet process. For example, 
Escobar and West (1995) develop a Gibbs sampler to estimate the distribution 
of parameters of a Gaussian mixture model. In general, a Dirichlet mixture is a 
hierarchical model expressed as 

9 ~ DP a 

7ri,... ,7r„|6» - e 

Xi\lTi, . . . ,7T n ~ f(X\7Ti) 

From Theorem 7, the posterior for 9 given tti, . . . , 7r„_i is a Dirichlet process 
with base measure a(n — 1) = a/ (a + n — 1)5 + J27=i V( a + n ~ 1)^ > where 
a = a(X). Furthermore, this is also the distribution of tt„\tti, . . . , 7r„_i. Thus, 
with positive probability, 7r„ will be a previous value of 7Tj. Otherwise, it is drawn 
from a. As a result there will be k < n unique values of 7Tj. This induces a latent 
class model for X in which each class is defined by a shared value of 7Tj. The 
observations are conditionally independent given this latent class. A key feature 
of the Dirichlet process is that the number of latent classes is estimated. It is 
clear from the form of a(n) that this estimate is influenced by a = a(X). When 
a is large, new values of pii will often be drawn from a. Contrarily, when a is 
small, iti will more often be drawn from the previous values. 

This is a natural setting for hyper Dirichlet processes. If A is a multivariate 
random variable, then 9 can be a hyper Dirichlet process for some graph. Once 
again the observations will be conditionally independent given their latent class. 
Furthermore, the components of X will have the independence structure spec- 
ified by the graph. For example, Escobar and West (1995) develop an MCMC 
algorithm for estimating Dirichlet mixture of Gaussian distributions. This can 
be extended to a mixture of the family jVg of Gaussians that are Markov on Q 
by restricting the base measure to Qg. For each update, the posterior of 9 will 
be a hyper Markov prior. 

As a final note, we point out that some of the results apply to other stick- 
breaking measures. Notably, Theorem 16 did not rely on the distribution of the 
random weights. Therefore, the same conditions imply that any stick-breaking 
measure is hyper Markov. That is, if the H is Markov and the refinement con- 
dition holds, then a stick-breaking prior whose atoms have distribution H is a 
hyper Markov law. Whether or not the posterior is also hyper Markov depends 
on how the measure is updated. For the Dirichlet process, the posterior up- 
date mechanism ensures a hyper Markov posterior as long as the prior is hyper 
Markov. 
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Appendix A: Appendix: Working with Non-consistent Base 
Measures 

We constructed the hyper Dirichlet process by combining base measures on each 
clique into a base measure on the entire graph. As we have seen, the end result 
is simply another Dirichlet process. In other words, the hyper Dirichlet process 
is simply a Dirichlet process that is hyper Markov on the graph of interest. The 
benefit is that the elicitation of the base measure is simplified by only considering 
a subset of variables at a time. In the current paper, we considered the case in 
which the individual base measures are consistent (i.e. they agree about the 
intersection.) However, we may not be able to guarantee consistency, especially 
if component base measures are elicited from different experts or models. How 
can this be resolved? 

Consider a two cliques, A and B with intersection C. Let fj, be a measure on 
Xa and A be a measure on X& . The two measures are consistent if the marginals 
over Xc are equal. In Section 4.2 we stated that this can be expressed as two 
simultaneous conditions: (1) the marginals must be proportional, and (2) the 
marginals must have the same scale. 

Suppose only the first condition holds. Recall that ^(Xa) represents the prior 
sample size. The interpretation is that there is more prior information about one 
clique than there is about the other clique. Equation 11 shows that there is no 
Markov Combination of \i and A. That is, /x*A does not exist. Thus, there is no 
hyper Dirichlet process with those marginals. Fortunately, it is still possible to 
generate an "almost appropriate" random distribution. This is possible because 
we only need \x * A. By the commutative property, we can use ~p* A instead. This 
is well-defined since the first property ensures that ~p and A are consistent. On 
the other hand, the difference in scale must be resolved if a problem requires 
having a well-defined prior or posterior. The simplest way to achieve this is to 
scale one measure up or down to match the other. Additionally, any convex 
combination of /iXa and \X& could be a logical choice. The most conservative 
choice would be to set ^{Xa) = A(A'b) = min{/x(AA), A(<Yb)}- If resources are 
sufficient, both scales could be used and the results compared. This would reveal 
how sensitive the outcome is to the scale of the base measures. 

Now suppose that the first condition does not hold. The interpretation is 
that we have conflicting prior information. Once again, fi * A does not exist. 
Furthermore, ~p * A does not exist either, so it is not possible to use the same 
method to generate random distributions. In order to find a base measure, one 
or both distributions must be changed. There are several natural ways to do 
this. Let U C X A , V C X B , W C X c . 

1. Choose one base measure, and complete the distribution via conditioning. 



a A (U xV xW) = n(U x W)X(V\W). (39) 

a B (U xV xW) = fj,(U\W)X(VxW). (40) 

2. Calculate a weighted average. 
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a w (U xVxW)=ja A + {l-j)a B , 7 e [0,1]. (41) 

If there is no reason to choose one prior over the other, 7 = 1/2 is appro- 
priate. An interesting choice of 7 is / u(A?a)/(m(A'a) + A(Ab))- This gives 
more weight toward the prior with more information. 
3. Minimize the summed KL-divergence. Let /Uc and Ac be the marginals 
over Xc- 



argmin f _ /J, C , f T Ac ,.„.. 

a c = v I ^ c h / A c — • (42) 

JXa v Jx c v 

a(UxVxW) = )i(U\W)a c (W)X(V\W). (43) 

More work is needed to test these candidate solutions to form good recom- 
mendations about their use. 



D. Heinz/ Building Hyper Dirichlet Processes for Graphical Models 23 

References 

Bush, C. A. and MacEachern, S. N. (1996). A semiparametric Bayesian 
model for randomized block designs. Biometrika, 83 275-285. 

CARVALHO, C, Massam, H. and West, M. (2007). Simulation of hyper- 
inverse Wishart distributions in graphical models. Biometrika, 94 647-659. 
http : //ftp . stat . duke . edu/WorkingPapers/05-03 . html. 

Dawid, A. P. and Lauritzen, S. L. (1993). Hyper Markov laws in the sta- 
tistical analysis of decomposable graphical models. The Annals of Statistics, 
21 1272-1317. 

Dempster, A. (1972). Covariance selection. Biometrics, 28 157-175. 

Dobra, A., Hans, C, Jones, B., Nevins, J., Yao, G. and West, M. (2004). 
Sparse graphical models for exploring gene expression data. Journal of Mul- 
tivariate Analysis, 90 196-212. 

ESCOBAR, M. D. and West, M. (1995). Bayesian density estimation and 
inference using mixtures. Journal of the American Statistical Association, 90 
577-588. 

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. 
The Annals of Statistics, 1 209-230. 

Ghosh, J. K. and Ramamoorthi, R. (1995). Consistency of Bayesian infer- 
ence for survival analysis with or without censoring. In Analysis of Censored 
Data (H. Koul and J. Deshpande, eds.). 95-104. 

Giudici, P. and Green, P. (1999). Decomposable graphical gaussian model 
determination. Biometrika, 86 785-801. 

Griffin, J. E. and Steel, M. F. J. (2006). Order-based dependent Dirichlet 
processes. Journal of the American Statistical Association, 101. 

Holland, P. W. and Leinhardt, S. (1981). An exponential family of proba- 
bility distributions. Journal of the American Statistical Association, 76 33-50. 

Kim, Y. and Lee, J. (2001). On posterior consistency of survival models. The 
Annals of Statistics, 29 668-686. 

Letac, G. and Massam, H. (2007). Wishart distributions for decomposable 
graphs. The Annals of Statistics, 35 1278-1323. 

Liu, J. and Massam, H. (2006). The conjugate prior for dis- 

crete hierarchical log-linear models. http://www.citebase.org/ 

abstract ?id=oai : arXiv. org: math/0609 100. 

Marrelec, G. andBENALi, H. (2006). Asymptotic Bayesian structure learning 
using graph supports for Gaussian graphical models. Journal of Multivariate 
Analysis, 97 1451-1466. 

Pattison, P. and Wasserman, S. (1999). Logit models and logistic regression 
for social networks: II. multivariate relations. British Journal of Mathematical 
and Statistical Psychology, 52 169-193. 

PlEVATOLO, A. and Rotondi, R. (2000). Analysing the interevent time distri- 
bution to identify seismicity phases: A Bayesian nonparametric approach to 
the multiple changepoint problem. Applied Statistics, 49 543-562. 

Robins, G., Pattison, P. and Wasserman, S. (1999). Logit models and 
logistic regressions for social networks: III. valued relations. Psychometrika, 



D. Heinz/ Building Hyper Dirichlet Processes for Graphical Models 24 

64 371-394. 
Roverato, A. and Whittaker, J. (1998). The Isserlis matrix and its ap- 
plication to non-decomposable graphical Gaussian models. Biometrika, 85 

711-725. 
Schervish, M. J. (1995). Theory of Statistics. Springer- Verlag, New York. 
Sethuraman, J. (1994). A constructive definition of Dirichlet measures. Sta- 

tistica Sinica. 
Speed, T. and Kiiveri, H. (1986). Gaussian Markov distributions over finite 

graphs. The Annals of Statistics, 14 138-150. 
Strauss, D. and Ikeda, M. (1990). Pseudolikelihood estimation for social 

networks. Journal of the American Statistical Association, 85 204-212. 
Susarla, V. and Ryzin, J. V. (1976). Nonparametric Bayesian estimation 

of survival curves from incomplete observations. Journal of the American 

Statistical Association, 71 897-902. 
Wasserman, S. and Pattison, P. (1996). Logit models and logistic regressions 

for social networks: I. An introduction to Markov graphs and p*. Psychome- 

trika, 61 401-425. 



